- Home
- Search Results
- Page 1 of 1
Search for: All records
- 
                                    Total Resources2
- Resource Type
- 
                                    
                                    
                                    
                                    0001000001000000
- More
- Availability
- 
                                    
                                    11
- Author / Contributor
- Filter by Author / Creator
- 
                                    
                                        - 
                                                    
                                                        
                                                            
                                                            Hayase, J (2)
- 
                                                    
                                                        
                                                            
                                                            Oh, S (2)
- 
                                                    
                                                        
                                                            
                                                            Choi, Y (1)
- 
                                                    
                                                        
                                                            
                                                            Hofmann, V (1)
- 
                                                    
                                                        
                                                            
                                                            Kong, W (1)
- 
                                                    
                                                        
                                                            
                                                            Liu, A (1)
- 
                                                    
                                                        
                                                            
                                                            Smith, N A (1)
- 
                                                    
                                                        
                                                            
                                                            Somani, R (1)
- 
                                                    
                                                        
                                                            
                                                            #Tyler Phillips, Kenneth E. (0)
- 
                                                    
                                                        
                                                            
                                                            #Willis, Ciara (0)
- 
                                                    
                                                        
                                                            
                                                            & Abreu-Ramos, E. D. (0)
- 
                                                    
                                                        
                                                            
                                                            & Abramson, C. I. (0)
- 
                                                    
                                                        
                                                            
                                                            & Abreu-Ramos, E. D. (0)
- 
                                                    
                                                        
                                                            
                                                            & Adams, S.G. (0)
- 
                                                    
                                                        
                                                            
                                                            & Ahmed, K. (0)
- 
                                                    
                                                        
                                                            
                                                            & Ahmed, Khadija. (0)
- 
                                                    
                                                        
                                                            
                                                            & Aina, D.K. Jr. (0)
- 
                                                    
                                                        
                                                            
                                                            & Akcil-Okan, O. (0)
- 
                                                    
                                                        
                                                            
                                                            & Akuom, D. (0)
- 
                                                    
                                                        
                                                            
                                                            & Aleven, V. (0)
 
- 
                                                    
                                                        
                                                            
                                                            
- Filter by Editor
- 
                                    
                                        - 
                                                    
                                                        
                                                            
                                                            null (1)
- 
                                                    
                                                        
                                                            
                                                            & Spizer, S. M. (0)
- 
                                                    
                                                        
                                                            
                                                            & . Spizer, S. (0)
- 
                                                    
                                                        
                                                            
                                                            & Ahn, J. (0)
- 
                                                    
                                                        
                                                            
                                                            & Bateiha, S. (0)
- 
                                                    
                                                        
                                                            
                                                            & Bosch, N. (0)
- 
                                                    
                                                        
                                                            
                                                            & Brennan K. (0)
- 
                                                    
                                                        
                                                            
                                                            & Brennan, K. (0)
- 
                                                    
                                                        
                                                            
                                                            & Chen, B. (0)
- 
                                                    
                                                        
                                                            
                                                            & Chen, Bodong (0)
- 
                                                    
                                                        
                                                            
                                                            & Drown, S. (0)
- 
                                                    
                                                        
                                                            
                                                            & Ferretti, F. (0)
- 
                                                    
                                                        
                                                            
                                                            & Higgins, A. (0)
- 
                                                    
                                                        
                                                            
                                                            & J. Peters (0)
- 
                                                    
                                                        
                                                            
                                                            & Kali, Y. (0)
- 
                                                    
                                                        
                                                            
                                                            & Ruiz-Arias, P.M. (0)
- 
                                                    
                                                        
                                                            
                                                            & S. Spitzer (0)
- 
                                                    
                                                        
                                                            
                                                            & Sahin. I. (0)
- 
                                                    
                                                        
                                                            
                                                            & Spitzer, S. (0)
- 
                                                    
                                                        
                                                            
                                                            & Spitzer, S.M. (0)
 
- 
                                                    
                                                        
                                                            
                                                            
- 
                                    Have feedback or suggestions for a way to improve these results?
 !
                                    
                                        
                                            Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
                                            Some full text articles may not yet be available without a charge during the embargo (administrative interval).
                                        
                                        
                                        
                                            
                                                
                                             What is a DOI Number?
                                        
                                    
                                
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
- 
            The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying *only* the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall.more » « lessFree, publicly-accessible full text available April 14, 2026
- 
            Hayase, J; Kong, W; Somani, R; Oh, S (, ArXivorg)null (Ed.)
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                     Full Text Available
                                                Full Text Available